---
id: tutorial-introduction
title: Tutorial Introduction
sidebar_label: Tutorial Introduction
---

**DeepEval** is the open-source LLM evaluation framework and in this complete end-to-end tutorial, we'll show you exactly how you can use DeepEval to improve your LLM application one step at a time. This tutorial will walk you through how to evaluate and test your LLM application all the way from the initial development stages to post-production.

:::info
Before we begin, run the following code to set up your Confident AI account and **retrieve your API key**.

```
deepeval login
```

:::

For **LLM evaluation in development**, we'll cover:

- How to choose your LLM evaluation metrics and use them in `deepeval`
- How to run evaluations in `deepeval` to quantify LLM application performance
- How to use evaluation results to identify system hyperparameters (such as LLMs and prompts) to iterate on
- How to make your evaluation results more robust by scaling it out to cover more edge cases

Once your LLM is ready for deployment, for **LLM evaluation in production**, we'll cover:

- How to continuously evaluate your LLM application in production (post-deployment, online evaluation)
- How to use evaluation data in production to A/B test different system hyperparameters (such as LLMs and prompts)
- How to use production data to improve your development evaluation workflow over time

:::tip
Just because your LLM application is in production doesn't mean you don't need LLM evaluation during development, and the same is true the other way around.
:::

## Terminologies

Before diving into the tutorial, let's go over the terminology used commonly used in LLM evaluation:

- **Hyperparameters**: this refers to the parameters that make up your LLM system. Some examples include system prompts, user prompts, models used for generation, temperature, chunk size (for RAG), etc.
- **System Prompt**: this refers to the prompt that sets the overarching instructions that define how your LLM should behave across all interactions.
- **Generation model**: this refers to the model used to generate LLM responses based on some input, and also the LLM to be evaluated. We'll be referring to this as simply model throughout this tutorial.
- **Evaluation model**: this refers to the LLM used for evaluation, **NOT** the LLM to be evaluated.

## Which Use Cases Will Be Evaluated?

We'll be going through a few use cases in this tutorial including:

- Legal document summarization
- Medical chatbot
- RAG QA Agent

Your use case might not be either one, and your evaluation criteria for each could be different, but that's OK. The concept is the same for all use cases - you pick a criteria, you use the metrics `deepeval` offers based on your criteria, and you iterate based on the results of these evaluations.

## Who Is This Tutorial For?

If you're building applications powered by LLMs, this tutorial is for you. Why? Because LLMs are prone to errors, and this tutorial will teach you exactly how to improve your LLM systems through a systematic evaluation-guided, data-first approach.
